Variant Discovery ◾ 121
in whole genome sequencing. However, several studies have shown that retaining PCR and
Illumina clustering duplicates does not cause significant artifacts as long as the library
complexity is sufficient. The duplicate alignments can be removed with “samtools rmdup”.
The following script creates a new subdirectory “dupRemBam” and then removes duplicate
alignments from the BAM files and stores the new BAM file in the new subdirectory:
mkdir dupRemBam
cd sortedbam
for i in $(ls *.bam);
do
samtools rmdup ${i} ../dupRemBam/${i} 2> ../dupRemBam/${i}.log
done;
cd ..
4.2.1.1.8 Varian calling with bcftools
It is time to use the bcftools for variant calling. The bcftools is a consensus variant caller as
discussed above. It takes a single or multiple BAM files as input and outputs a compressed
VCF file. The following script creates the subdirectory “variants” in the main project direc-
tory and then it uses “bcftools mpileup” for performing pileup and “bcftools call” to call
the variants for the piled-up reads:
mkdir variants
cd dupRemBam
ls *.bam | rev | rev > bam_list.txt
bcftools mpileup -Ou \
-f ../ref/GCF_009858895.2_ASM985889v3_genomic.fna \
-b bam_list.txt \
| bcftools call -vmO z \
-o ../variants/sarscov2.vcf.gz
cd ..
The VCF file that contains the variants (SNPs and InDels) will be saved in “variants” sub-
directory. For further analysis, the VCF file can be indexed using “tabix”, which is a generic
indexer for TAB-delimited genome position files.
tabix variants/sarscov2.vcf.gz
We can view the VCF file with “bcftool view” as follows:
bcftools view variants/sarscov2.vcf.gz | less -S
All steps from downloading the raw data to variant calling can be included in a single bash
file that can be executed on the command-line prompt of the Linux terminal. First, create
a new directory, make that directory the current working directory, and then create a file
with the run ids “ids.txt” as described above. Then, create a bash file “pipeline_bcftools.